EXPLORATORY DATA ANALYSIS by OLAF WIED

The data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Reference: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Input variables (based on physicochemical tests):

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm^3)
  11. alcohol (% by volume)

Output variable (based on sensory data):

  1. quality (score between 0 and 10)

Univariate Plots Section

We use ggpairs on a subsample:

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5
## Using  as id variables
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

Univariate Analysis

What is the structure of your dataset? Did you create any new variables from existing variables in the dataset?

We notice that vast majority of wines were assigned a rating between 5 and 7. There are no wines with ratings of 1, 2 or 10. It might be useful to combine the ratings and form 3 groups [3,4], [5,7] and [8,9]. We use “cut” to create a new variable “quality.joined”.

What is/are the main feature(s) of interest in your dataset?

We are interested in identifying the chemical properties of the white wines that could have influenced the quality rating. We will try to detect relationsships between the rating (variable “quality”) and the variables describing the chemical properties.

From the pairwise plots we can get an overview of the data:

  • There is no clear (linear) relationsship between the wine quality and another variable.
  • There appear to be several outliers (on the upper scale).

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

There are a few obvious interpendencies between other variables (e.g. alcohol and density, residual.sugar and density). These might also help to eliminate outliers. Further, (high) quality is not influenced by a single variable but rather a (optimal?) combination of chemical properties.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most histograms show very symmetric (Gaussian) behaviour with a few potential outliers. Alcohol and residual sugar are a little more skewed. Chlorides is also very symmetric and peaked around 0.04 but shows quite a few values above 0.1.

For some variables, we probably want to delete some outliers. This will be investigated next.

Bivariate Plots Section

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1654 1654           7.9            0.330        0.28           31.6
## 1664 1664           7.9            0.330        0.28           31.6
## 2782 2782           7.8            0.965        0.60           65.8
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1654     0.053                  35                  176 1.01030 3.15
## 1664     0.053                  35                  176 1.01030 3.15
## 2782     0.074                   8                  160 1.03898 3.39
##      sulphates alcohol quality quality.joined
## 1654      0.38     8.8       6          (4,7]
## 1664      0.38     8.8       6          (4,7]
## 2782      0.69    11.7       6          (4,7]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

While the high density of 1.03898 is an outlier, it is plausible as it has an extremely high amount of residual sugar.

However, with increasing resiudal sugar the influence of other variables should become weaker. We therefore discard the wine with the highest residual sugar.

High quality wines tend to have higher percentages of alcohol.

Residual sugar doesn’t show a clear influence on wine quality. As with many other variables, a certain level of a chemical can result in very different quality ratings.

Next, let’s look at acidity. We expect low ratings for high values, as too much acidity leads to a vinegary taste.

We see that for values up to 0.7, there is no clear influence on the wine quality. Ratings seem to decrease for values higher than 0.8. But there are only a few data points, so that we can be not sure about a true correlation. Nevertheless, it makes sense to break acidity into groups, especially because we can assume that hitting a certain high level (maybe not reached in the dataset) will eventually have a bad influence on the wine quality:

## [1] "Next, we investigate the second acidity variable:"

## [1] "We delete the data point with the highest fixed acidity because it is the only wine with a acidity in this range."
## [1] "One more variable about acidity: citrc acidity"

The investigation shows an overlap in quality for different levels of citric acidity. High quality seems to be associated with a smaller range of citric acidity.

Now, we turn our attention to chlorides and sulphates:

## wwq$quality.joined: (0,4]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01300 0.03750 0.04600 0.05056 0.05400 0.29000 
## -------------------------------------------------------- 
## wwq$quality.joined: (4,7]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04588 0.05000 0.34600 
## -------------------------------------------------------- 
## wwq$quality.joined: (7,10]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03550 0.03801 0.04400 0.12100
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 2523 2523           7.3             0.17        0.24            8.1
## 2526 2526           7.3             0.17        0.24            8.1
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 2523     0.121                  32                  162 0.99508 3.17
## 2526     0.121                  32                  162 0.99508 3.17
##      sulphates alcohol quality quality.joined
## 2523      0.38    10.4       8         (7,10]
## 2526      0.38    10.4       8         (7,10]

No apparent findings on sulphates. Next, sulfur dioxide:

## Warning in loop_apply(n, do.ply): Removed 8 rows containing missing values
## (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 8 rows containing missing values
## (geom_point).

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Wine quality doesn’t seem to vary in any simple way with one of the variables describing a specific chemical attribute. For most plots, we notice that low quality wines show a wider range (containing the smaller range of high quality wines) of values. While it might be hard to determine what makes a high quality wine, this could help determine when a chemical property becomes so extreme thatresults in a bad taste. (Best example would be acidity.) In some cases, we might be able to construct a weak linear relationship by transforming a variable, see sulfur above.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

We used the physical (nearly linear) relationsship between alcohol/sugar content and density to determine extreme values.

What was the strongest relationship you found?

The strongest (and also most obvious) relationship is the one between residual sugar and density. Also, alcohol and density are strongly correlated, even though residual sugar has the stronger influence (as “adding” alcohol can only lower to density to density of alcohol itself).

Relationsships between chemicals and wine quality are rather weak.

Multivariate Plots Section

The first plot is similar to the one above (after eliminating the outlier).

The scatter plot of fixed and volatile acidity supports our hypothesis that too high values of acidity (for at least one of the variables) might be correlated with lower scores (red).

We create two new variables:

  • acidity.total: volatile.acidity * fixed.acidity
  • sulfur.citrc: free.sulfur.dioxide * citric.acid (antioxidants and “freshness”)

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In general, there are no clear correlations between wine quality and its chemical properties. Our visualizations suggest that only extreme values (e.g. for acidity) may influence the rating in a negative way. The dataset is problematic as most wines are assigned a moderate rating and not much can be inferred about low or high ratings (high ratings, of course, are of particular interest).

Were there any interesting or surprising interactions between features?

No. Strong relationsships can only be found among chemical attributes (e.g. density and residual sugar). No surprises here.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

We created a simple tree model. However, relationsships are so weak that the model only makes use of two variables (alcohol and volatile acidity) and only assigns ratings of either 5 or 6.


Final Plots and Summary

Plot One

Description One

Most white wines obtain a rating between 5 and 7. Only very few ratings of 3 and 4 or 8 and 9 are assigned. There are no ratings less than 3 and no wine is rated 10. As most wines are of medium quality, it will be hard to determine what chemical properties are related are typical of high quality wines (if possible in the first place).

Plot Two

Description Two

The regression tree model results in a very simplistic structure with only two distinctions: If the alcohol level is below 10.85% and the volatile acidity is higher than 0.2525 g/dm^3 the wine will be assigned a rating of 5 (upper left area). In all other cases its rating will be 6. We added color and let size of the dots increase with quality. It seems like the wines in the upper left rectangle are (on average) of higher quality than the ones outside the recatangle. In fact, the tree method gives a mean of 5.361 for the wines in the upper left are and a mean of 6.131 elsewhere. Nevertheless, a very disappointing result. In general, no strong relationsships between wine quality and its chemical properties could be found.

Plot Three

Description Three

We combine several variables to visualize to get a more complete view of the data: On the x-axis we multiply the level of sulfur dioxide (an antioxidant) with the amount of citric acid (normalized per dm^3) as meausre of freshness. On the y-axis we multiply fixed and volatile acidity levels, which in high doses can lead to a vinegary taste. Further, we use our results from the tree regression models which suggests discriminating wine quality based on the alcohol content. The color distinction shows that high quality wines (the right facet) are more often found alcohol levels over 10.85%. The opposite holds for wine of quality 3 or 4. Also, high quality is more often found with high “freshness” (keep in mind the log scale on the x-axis). Most wines experience low “acidity” levels. However, higher levels are more often found for medium or low quality wines.


Reflection

The dataset contains almost 5000 white wines that were rated by three experts. Eleven chemical attributes like sulfur content, pH level etc. are listed.

Only weak relationships between the quality and the chemcial attributes could be found. This is little surprising because we can hardly expect to model (the only little understood and very complex sense) human taste with only eleven variables. Some variables are strongly correlated (e.g. density and alcohol content or amount of residual sugar). A tree regression model was applied but could only provide little inside. One problem might be that most wines are of medium quality (with ratings between 5 and 7). The dataset contains only few wines with ratings of 8 and 9. This makes it hard to make inferences about high quality wines. Also, combinations of variables could only slightly improve the situation. What we can say is the rather trivial fact that certain chemicals in very high doses (e.g. volatile acidity) are likely to have negative influence on the taste. If a wine shows only moderate chemical attribute, nothing can be said about its potential rating. As guideline, our visualizations showed the following:

might be more likely associated with high ratings, however, these are far from being sufficient criteria.

For a better understanding of wine quality more chemical properties are needed.